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Abstract 

Background: Lymph node invasion is one of the most powerful clinical factors in cancer prognosis. However, 
molecular level signatures of their correlation are remaining poorly understood. Here, we propose a new approach, 
monotonically expressed gene analysis (MEGA), to correlate transcriptional patterns of lymph node invasion related 
genes with clinical outcome of breast cancer patients. 

Results: Using MEGA, we scored all genes with their transcriptional patterns over progression levels of lymph node 
invasion from 278 non-metastatic breast cancer samples. Applied on 65 independent test data, our gene sets of 
top 20 scores (positive and negative correlations) showed significant associations with prognostic measures such as 
cancer metastasis, relapse and survival. Our method showed better accuracy than conventional two class 
comparison methods. We could also find that expression patterns of some genes are strongly associated with 
stage transition of pathological T and N at specific time. Additionally, some pathways including T-cell immune 
response and wound healing serum response are expected to be related with cancer progression from pathway 
enrichment and common motif binding site analyses of the inferred gene sets. 

Conclusions: By applying MEGA, we can find possible molecular links between lymph node invasion and cancer 
prognosis in human breast cancer, supported by evidences of feasible gene expression patterns and significant 
results of meta-analysis tests. 



Background 

The presence of lymph node invasion is one of the 
strongest indicators for prognoses of distant metastasis 
and survival in most cancers [1,2]. In the multi-step pro- 
cess of cancer metastasis development, invasion into a 
vascular or a lymphatic system has generally been 
believed to be a key step of tumor cell dissemination 
[3-5]. Once tumor cells acquire abilities of intravasation 
and survival in an unfavorable vascular environment, 
they circulate around the whole body parts to form new 
tumors at the secondary site [6]. While the exact 
mechanisms of cancer metastasis through blood vessels 
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and lymph nodes are still being studied, it is necessary 
to explain the processes in a genetic level as a key factor 
of cancer patients' prognosis. 

Many researchers have devoted their efforts to under- 
stand lymph node invasion in breast cancers, because 
regional lymph nodes are frequently observed as the 
first site of metastasis [7]. Survival analyses with clinical 
features showed that lymph node status is generally 
marked as a top significant factor among conventional 
clinical features [8-10]. Studies of finding molecular 
markers using genome-wide expression profiles identi- 
fied various genetic signatures for prediction of lymph 
node and distant metastasis [11-19]. However, the asso- 
ciations between conventional clinical features including 
tumor size, lymph node involvement and distant metas- 
tasis (TNM staging [20]) and prognosis are not yet 
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identified in a genetic level Moreover, the existence of a 
common gene set for lymph node metastasis in a tran- 
scriptional level is unclear [21]. 

So far, £-test based differential expression analysis or 
clustering methods between lymph node negative and 
positive samples have been used to detect corresponding 
gene sets [17-19,21]. Although these methods are 
straightforward and intuitive, there are several inherent 
problems in them. First, direct comparison within two 
classes (lymph node negatives and positives) may sim- 
plify the subtle changes over cancer progression. 
Usually, four-stage pathological N (N0-N3) is used to 
indicate the degree of lymph node invasion in breast 
cancer; NO denotes no lymph node invasion observed. 
Regarding the expression values as longitudinal data to 
find patterns over lymph node progression might benefit 
from utilizing known biomedical information. For 
instance, a gene whose expression is significantly high 
only at a certain stage (e.g. N2) is hardly accepted as a 
closely related gene from current metastasis model. 
However, a two-class comparison (e.g. NO vs. others) 
would mark it differently expressed. Second, the effect 
of a factor (e.g. lymph node invasion) should be sepa- 
rated from the effect of the others (e.g. tumor size or 
histological subtype). These factors are generally not 
independent and will lead to false findings unless care- 
fully analysed. And, the validation of inferred gene sig- 
natures should be performed on sufficient number of 
independent sets in a strict statistical manner. The data 
statistics and characteristics including inherent biases 
should be recognized, appropriately treated and be prop- 
erly analyzed in a meta-analytic way. 

There are several statistical models applicable for mul- 
tivariate correlation scoring (instead of two-class based 
scoring). Linear/Non-linear (multiple) regression and 
analysis of variance models (two-way ANOVA and 
MANOVA) have been widely used in various fields. 
Both models (linear and non-linear), however, have a 
few weaknesses; a gene expression pattern over lymph 
node progression is not necessarily linear, and the data 
has too few time points to be assessed in a non-linear 
way. ANOVA models are usually used to test if there is 
a significant difference among the mean values, so it is 
not robust to inconsistent fluctuations of expression 
values. In time series analyses, autoregressive moving 
average model and its variants (ARMA, ARMAX and 
ARIMA) are widely used especially in electronic engi- 
neering and system identification fields, and some unit 
root tests (for stationarity test in time series including 
Augmented Dickey Fuller test [22]) have been used in 
statistics and econometrics as well. However, there are a 
few difficulties in adapting these models to our problem; 
the number of time points is very few, intervals are not 
regular and the stage is a pseudo-time. After reviewing 



the conventional models, we developed a new multivari- 
ate correlation measure specially designed for non-linear 
and small data point analysis. Nevertheless, the conven- 
tional models were applied as well and tested to com- 
pare with our measure and two-class based analyses. 

Our method, monotonically expressed gene analysis 
(MEGA), scores gene expression patterns with their 
non-linear monotonicity over a stage progression of 
interest. It accumulates all the normalized expressional 
differences between two consecutive stages (see Meth- 
ods). If the direction of expressional change is consis- 
tently positive or negative, the score increases; 
otherwise, the sum of differences will be cancelled out. 
Because there are two non-independent factors (stage T 
and N), one variable should be fixed while the other 
variable is being used. In MEGA, a two dimensional 
matrix is constructed, each dimension of which is com- 
posed of four points (N0-N3 and T1-T4, TO is 
excluded due to the lack of data) generating totally 16 
data points per a gene. So, applying the scoring function 
to each row or column represents calculating the cumu- 
lative expressional changes over one factor while the 
other is fixed. MEGA also has a weight parameter to 
emphasize a specific stage transition (e.g. Nl— > N2) to 
capture genes activated or repressed in a particular time 
range. After calculating scores, top k genes are collected 
and named N-wise monotonically expressed genes (N- 
MEG) or T-wise monotonically expressed genes (T- 
MEG) depending on which factor is used for the analy- 
sis. Validation of inferred gene sets can be done in a ret- 
rospective way to see how accurately the gene sets 
classify prognostic outcomes in other independent data. 
P-values from each test data are integrated by meta-ana- 
lysis to report more confident accuracy of the gene sets. 
This is basically one of the most unbiased ways for eval- 
uating usefulness of inferred gene signatures. 

If the gene sets show consistence and confident accu- 
racy, a series follow-up analysis can be used for reason- 
ing biological meaning (e.g. common pathways or 
transcription factors). First, gene set analysis can dis- 
cover some biological pathways involving in metastasis 
progression. Considering pathways instead of individual 
genes as an acting unit of biological phenomena 
explains how different gene sets are sometimes asso- 
ciated with same conditions. And we can find more suc- 
cinct way to describe the whole processes. Second, the 
fact that the genes show similar expression patterns as 
the cancer metastasis progresses leads us to a hypothesis 
that some common transcription factors play a crucial 
role in the process. Here, all the genes are not necessa- 
rily causative; rather, they are effect from changes of a 
fewer number of genes in upper hierarchy. In this case, 
finding frequently represented motifs from the promoter 
regions of the gene sets might be a good analysis for 
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discovering the transcription factors. This would be 
more powerful information in practical applications 
such as pharmaceutical research and patient treatment. 

Results 

Totally four gene sets of size 20 are constructed from 
278 breast tumor gene expression data (expO database) 
by applying monotonically expressed gene analysis 
(MEGA). They are N-wise monotonically expressed 
genes (N-MEG) and T-wise monotonically expressed 
genes (T-MEG), which are further divided into positive 
and negative correlation sets. Given these four gene sets 
(N-MEG+, N-MEG-, T-MEG+, and T-MEG-), we tested 
on65 independent breast cancer prognosis data sets 
downloaded from ONCOMINE database (See Methods 
for details) how much the expression values of the 
genes are correlated with prognostic outcomes. 

Lymph node-wise monotonically expressed genes (N- 
MEG) 

The result of meta-analysis test with N-MEG+ and N- 
MEG- is shown in Figure 1. Two gene sets are divided 
by the vertical separator. Three major analysis types 
(PRG, STG, and GRD) and seven minor analysis types 
are denoted in the first and second columns. Each row 
corresponds to an experiment and each column corre- 
sponds to a gene. So a value in a cell is a p-value of a 
gene in the corresponding experiment. Cells are colored 
blue when the genes are significantly up-regulated at the 
study, yellow when down-regulated, black when not sig- 
nificantly regulated, and grey when the genes could not 
be found in the corresponding experiments; here, up- 
regulation means genes are up-regulated in bad-prog- 
noses, higher stages, and higher grades. 

It is easily shown that N-MEG + genes are positively 
correlated with worse prognoses, higher tumor stages 
and higher tumor grades. Similarly, N-MEG- is nega- 
tively correlated. From the p-value matrix, we can calcu- 
late integrated p-values using three meta-analysis 
methods over ten test classes (Figure 2). It is easily 
found that the N-MEG is highly significant in all types 
of prognosis analyses (p-values less than 10" 14 in any 
methods). Except the test for stage M (current status of 
metastasis), all p-values were less than 0.01. The study 
of stage M is designed for elucidating differences of 
gene expression profile between primary tumors and 
metastatic tumors. The conceptual difference from the 
prognosis study of metastasis is that while the former 
describes the status of 'metastasis occurred', the latter 
describes 'metastasis will occur'. The results of stage N 
and stage T were intermediately significant; five of the 
seven studies in the stage N are two class comparisons 
(NO vs. others). Correlations with the tumor grade stu- 
dies were extremely significant. It is also shown that the 



Stouffer's Z method gives relatively more conservative 
results. As the Stouffers Z method has been proven to 
be more robust to a few extreme values [23] and cor- 
rectable here (see Methods), we will use the corrected 
version of this method for rest of the study. 

In a comparison with gene sets from previous work, 
the N-MEG showed the highest association with cancer 
prognoses (Figure 3). Gene sets from a multiple regres- 
sion and a two-way ANOVA model followed it and 
other two gene sets (Suzuki et al and Ellsworth et al) 
showed relatively lower significance. This result implies 
that the pattern based methods (MEGA, two-way 
ANOVA and multiple regression models) are more 
effective than two class direct comparison methods (t- 
test and clustering) in finding prognosis associated 
genes. On the other hand, ANOVA and Suzuki set 
showed the best score with the N stage. Like we already 
mentioned, most of the existing N stage test sets are 
based on a £-test within two classes, which is the same 
method as what Suzuki et al used. In other analysis 
types including M and T stage grouping and tumor 
grade, we could not find significant differences among 
five methods. Abba set was also tested even though the 
gene set was already had a selection step using prog- 
nosis data (selecting 46 top ranked genes from 300 
genes, see Methods). The test showed that our gene set 
was comparable to it (better in metastasis, relapse and 
overall prognosis) in spite of a significant degree of 
unfairness. 

Overall aspects of gene expression progression along 
the N stage give explicit explanations of differences 
among the candidate gene sets (Figure 4). In the N- 
MEG and multiple regression model-based gene sets 
show consistent increase or decrease along the N stage 
independent to the T stage (Figure 4A and 4B). How- 
ever, gene sets from two class direct comparison meth- 
ods (£-test and Mann- Whitney test) show certain degree 
of inconsistency and discrepancy between lymph node 
phenotypes and gene expression patterns (Figure 4D, 
4E, and 4F). This result shows that those gene sets 
(lymph node positive vs. negative) may contain false 
positives from abstracting detailed pattern information, 
and also implicates the reason why N-MEG showed 
relatively high significance in the prognosis test. 

Classification and survival analysis 

To show the classification power of the N-MEGs, we con- 
ducted a test for 5 -year metastasis free survival data from 
Wang et al [15]. Because the meaning of N-MEG + and N- 
MEG- is so clear, we scored the sum of row-normalized z- 
scores of corresponding genes; adding for 20 N-MEG+ 
genes and subtracting for 20 N-MEG- genes. For the 286 
primary breast samples (91 metastasis in 5-years), the 
mean score was nearly zero (6.5xl0~ 13 ) and the standard 
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Figure 1 N-MEG with their meta-analysis test result. Here, 20 N-MEG+ and 20 N-MEG- genes are tested. Each column corresponds to a 
specific gene, and each row to an experiment in the ONCOMINE test set. Test set of 65 experiments are classified into three major classes (PRG, 
STG, and GRD), each of which are subdivided into several minor classes. Blue color denotes up-regulation, and yellow color denotes down- 
regulation. Experiments with an asterisk (*) denote they used the expO database, and were excluded from further analyses. PRG=prognosis, 
MP= metastatic prognosis, RLP=relapse prognosis SVP=survival prognosis, STG=stage, M=M stage, N=N stage, G=stage grouping, T=T stage, 
GRD=tumor grade. 



deviation was 7.8. From the 51 patients whose scores were 
bigger than the mean plus one standard deviation, 37 had 
metastasis in 5-years giving 0.35 of sensitivity and 0.92 of 
specificity. The overall accuracy was 0.71. An ROC curve 
was drawn to compare the N-MEGs with other gene sets 
(Figure 5A). The N-MEGs showed the best classification 
power. Interestingly, while the two statistical approaches 



using stage progression (multiple regression and two-way 
ANOVA) managed to prove a certain degree of usefulness, 
the studies using two class comparisons did not. Although 
the result may be further improved by other fancy classi- 
fiers with optimization procedures, we can tentatively con- 
clude that observing the signatures of stage progression 
gives better results. A set of area under curve (AUC) were 
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distinct metastasis free survival and hazard rate in Kaplan 
Meier estimation (Figure 5 C and D). 

T-wise monotonically expressed genes (T-MEG) and 
comparison with N-MEG 

T-MEG (n=40, 20 T-MEG+ and 20 T-MEG-) were also 
significantly correlated with breast cancer prognosis 
including metastasis and relapse, but the significance 
was generally worse than N-MEG (Table 1). In the 
prognosis of metastasis studies, both of the T-MEG + 
and T-MEG- were significant (p-values of 4.3xl0~ 8 
and3.1xl0~ 3 respectively), but they were not as effective 
as N-MEG (p-values of 1.2x10 15 andl.4xl0" 6 ). This 
result agrees with the previously known pathological 
facts; both of the degree of lymph node invasion and 
tumor size are important in predicting metastasis prob- 
abilities, while the former gives more direct evidences. 
We can also notice that tumor size related gene were 
either not significant (in prognosis of survival and 
tumor stages) or less significant than lymph node inva- 
sion related genes (in prognosis of relapse and tumor 
grade). 

The distinct characteristics between the two tumoral 
features might be tumor tissue specific. Breasts are not 
essential organs for personal survival. So even though a 
tumor has grown to be large, the cancer is not a fatal 
disease unless the tumor has been spread to other 
organs. In this case, mastectomy would be effective for 




Overall Metastasis Relapse Survival Overall M Stage N Stage T Stage Stage Grade (/10) 



Prognosis Prognosis Prognosis Prognosis Stage Grouping 

Figure 3 Comparison with previous studies. Each value corresponds to a sum of log-odd scores from up and down regulated gene sets. In 
prognosis analyses, pattern based methods (MEGA, multiple regression and ANOVA models) showed better results than two-class comparison 
methods (Paired t-test in Suzuki and Mann-Whitney test in Ellsworth). N-MEG (blue) showed the best significance among all the gene sets. 
Instead, N-MEG and a multiple regression set showed relatively low significance in tumor stage data; probably because most of the N stage test 
sets used two-class comparison methods. Values in tumor grade analyses were scaled down to 1/10 for better presentation of the graph. N-MEG 
= N-wise monotonically expressed genes, MR = multiple regression, 2ANOVA = two-way ANOVA. 




Stage M 



—•—Fisher's Chi Square — ■— Stouffer's Overall Z Cumulative Binomial 

Figure 2 P-values of N-MEG over ten analysis types. Each line 

denotes a different kind of meta-analysis method. Except the 'stage 

M' class, all the p-values are less than 0.01 (and extremely low in 

prognosis analyses). Each p-value was calculated by multiplying two 

p-values from 20 up and 20 down genes, 
v J 



denoted in Figure 5B. The AUC of MEGA was 0.69 (0.626 
~ 0.757 in 95% confidence limits). To conduct a survival 
analysis we divided the all 286 patients into three groups 
of equal size (n=95 for good and poor group, 96 for inter- 
mediate group). It is shown that the three groups have 
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Figure 4 Aspects of gene expression patterns. Top ranked genes from four different methods are shown (A: N-MEG, B: multiple regression, C: 
two-way ANOVA, D: Suzuki set, E: Ellsworth set, and G: Abba set). Genes in the first row are up-regulated genes in lymph node positive samples 
in each study, the second row are down regulated genes (except in B; ANOVA does not give directional results). Lymph node progression is 
denoted in X-axis, and the relative expression values against normal breast samples are denoted in Y-axis (in log 2 based fold change). Four 
different colors are used to discriminate different tumor sizes (T stage 1 to 4). Expression patterns of two genes from N-MEG and multiple 
regression (FLJ32549, AKR1D1, TMEM65 and MGC35361) show consistent increase (top row) or decrease (bottom row) in all tumor sizes, while 
top-ranked genes from other studies (D~F) show either inconsistency among different tumor sizes (D up, E and F up) or disagreement over 
phenotypic traits (D down, E down, and F up). We can see that the multiple regression model find more linear relationships (B) and ANOVA set 
contains significant fluctuations (C). 

V J 



improving patients' survival rate. Although cancer prog- 
nosis is a result of complex and stochastic activities 
among cellular processes, we can conclude this tendency 
would be valid for other non-essential organs. 

Genes related to specific N stage transitions 

We further tested the significance of genes related to 
specific steps of lymph node invasion progression by 
altering a leaping factor p. The leaping factor was set to 
10 and applied to three different steps (NO— »N1, 
N1^N2, and N2^>N3) with a factor remains to be 
zero. These gene sets are N-wise monotonically 
expressed genes with a leaping at a specific stage transi- 
tion Ah>B (N-MEG a ^ b : here, N-MEG 0 ^ 1 , N-MEG 1 ^ 2 , 
and N-MEG 2 ^ 3 respectively). Interestingly, we found 



significant discrepancy among lymph node invasion pro- 
gression steps (Table 2). Genes which were significantly 
up or down regulated in the Nl— »N2 progression (N- 
MEG 1 ^ 2 ) were of no significance in most of the prog- 
nosis and tumor stage studies (p-values > 0.01). Instead, 
N-MEG 0 ^ 1 and N-MEG 2 ^ 3 were significant in most of 
the studies including prognosis of metastasis, prognosis 
of relapse, prognosis of survival and tumor grade. 

Firstly, we expected that N-MEG 0 ^ 1 would be more 
informative than N-MEG in the other stage transitions. 
Because, it is thought that if a set of tumor cells acquire 
high motility to migrate and intravasate into lymph 
nodes, dissemination of tumor cells over the larger parts 
of lymph nodes would follow spontaneously [5], But the 
result of meta-analysis test represents that there would 
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Figure 5 Classification and survival analysis using N-MEG genes. (A) An ROC curve shows that the genes from MEGA analysis (N-MEG) have 
the strongest classification power among other candidate lymph node related genes. (B) Areas under ROC curve (AUC) denote the statistical 
measurement of the classification power. Note that the genes from previous two-class comparisons hardly prove their usefulness in classification; 
but the result may differ when conducted in other classification methods. (C and D) Three risk groups (good, intermediate and poor) show 
distinct survival and hazard functions in Kaplan-Meyer diagrams. AUC=area under curve, Std. Error=standard error, Asym. Prob.=Asymptotic 
probability, LCL=lower confidence limit, UCL=upper confidence limit. 



be another transcriptional changing event in the late 
step of lymph node invasion before raising distant 
metastasis. 

To inspect the characteristics of N-MEG 0 ^* 1 and N- 
MEG 2 ^ 3 , we chose 200 genes from each gene set (100 
positive and 100 negative genes in N-MEG 0 ^ 1 and N- 
MEG 2 ^ 3 ) and compared them each other. We found 
that there were few overlaps between two gene sets; no 
overlap in top 20 genes, and only two overlaps in 200 
genes. But in the gene function analysis using Gorilla 
[24], both gene sets were enriched in the immune 
response GO terms (p-values ~ l.OxlO" 4 ). Where the 



immune response is a well-known process affecting 
lymph node invasion [25-27], it is convincing that both 
gene sets are distinct but closely related to lymph node 
invasion by connected pathways (see Additional Files 1 
and 2 for full enrichment map). 

Pathway analysis of N-wise progression 

To observe changes of pathways in N-wise progression, 
we applied Gene Set Enrichment Analysis (GSEA) [28] 
to N-MEG. All 20,073 genes were sorted by their LE' 
scores in descendant order. And the sorted list was ana- 
lyzed by the GSEA Preranked test using 1,186 curated 



Table 1 Comparison of N-MEG and T-MEG in a meta-analysis test 



Analysis 


N-MEG (n=40) 




T-MEG (n=40) 
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1 .1x1 0" 9 


Stage M 


0.38 


0.28 


0.91 


2.8x1 0" 4 


Stage N 


0.017 


0.10 


0.15 


0.57 


Stage T 


2.1 x10" 3 


0.074 


0.098 


0.15 


Stage G 


0.063 


7.2x1 0" 3 


0.93 


0.18 


Overall Stage 


2.1 x10" 4 


1.2x10" 3 


0.46 


0.075 


Tumor Grade 


1.3X10" 104 


1.3x10" 8 


2.3x1 0" 8 


6.0x1 0" 10 



N-MEG showed better significance than T-MEG. Generally speaking, N-wise progression gives more information than T-wise progression in cancer prognosis. 
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Table 2 Comparison of stage transition specific genes 



Analysis 


N-MEG 0 ^ 1 


(n=40) 


N-MEG 1 


(n=40) 


N-MEG 2 ^ 3 


(n=40) 




+ 




+ 




+ 




MP 


2.8x1 0" 4 


0.13 


0.24 


0.13 


5.7X10" 12 


3.8x1 0" 11 


RLP 


7.7x1 0" 7 


1.0X10" 15 


0.13 


0.084 


4.8x1 0" 11 


2.3x1 0" 6 


SVP 


5.9x1 0" 4 


0.03 


0.75 


0.53 


1.2X10" 10 


1.4X10" 5 


OVP 


1.2X10" 11 


9.3x1 0" 1 3 


0.24 


0.08 


7.8x1 0" 29 


2.0x1 0" 1 7 


STM 


0.042 


0.14 


0.97 


0.30 


0.043 


0.035 


STN 


0.030 


3.5x1 0" 6 


0.04 


0.068 


9.4x1 0" 4 


0.06 


STT 


1.8X10" 3 


2.4x1 0" 10 


0.12 


0.069 


5.9x1 0" 7 


0.75 


STG 


1.1X10" 5 


6.9x1 0" 9 


0.01 


0.015 


0.11 


0.53 


Overall Stage 


6.6x1 0" 8 


2.3x1 0" 20 


4.2x1 0" 3 


1.4X10" 3 


7.0x1 0" 8 


0.15 


Tumor Grade 


2.1X10" 16 


2.9x1 0" 22 


7.2x1 0" 6 


0.48 


1.1X10" 42 


1.7X10" 58 



N-MEG 2 ^ 3 ((3=10) showed the best significance in cancer prognosis test sets. While N-MEG°^ 1 was also significantly associated with prognosis and some stages, 
N-MEG 1 ^ 2 showed surprisingly insignificant results. MP= prognosis of metastasis, RLP=prognosis of relapse, SVP=prognosis of survival, OVP=overall prognosis, 
STM=stageM, STN=stageN, STG=stageG. 



MSIGDB gene sets (C2-CGP: chemical and genetic per- 
turbations). In the N-MEG+ genes, we found that top 
ranked enriched pathways were closely related to T cell 
activities, cell differentiation and wound healing path- 
ways (Table 3). It is previously known that immune 
response has dual role in tumor initiation and progres- 
sion (reviewed in [26]) - inhibition of tumor growth by 
antitumor cytotoxic T-cell activities (reviewed in [29]), 
and promotion of tumorigenesis, invasion and metasta- 
sis by arising chronic inflammatory environment 
[30-32]. Recently, DeNardo et al found that CD4+ T 
cell promotes lung metastasis of breast cancer through 
macrophages [33]. These evidences are in concordance 
with our N-MEG+ result by supporting that mainte- 
nance or increase of N-wise gene expression is closely 
correlated with lymph node invasion and poor prog- 
nosis. It is also well known that a serum response of 
fibroblasts including wound healing pathways efficiently 
predicts cancer progression [34]. Other two gene sets in 
the Table 3 are from previously studied result about 
general differentiation of tumor cells (tumor grade) and 
prognosis, which support that the N-MEG+ are nega- 
tively correlated with differentiation and prognosis. 

Common TF binding site prediction for N-MEG 

Because the N-MEG+ and N-MEG- are already selected 
from their expression patterns along the lymph node 



invasion progression, we can hypothesize that they are 
co-regulated by several core transcription factors. To 
find candidate common transcriptional regulators, we 
analyzed upstream regions of the N-MEG and selected 
significantly over-represented motif binding sites using 
Pscan program [35]. We selected top 20, 30, and 50 N- 
MEG and obtained matching mRNA RefSeq sequences 
from DAVID database [36,37]. By running Pscan on 
[-450, +50] upstream regions onto the JASPAR database 
[38], we found that ELK4 and ELK1 binding sites were 
significantly over-represented (p-values 7.8xl0" 7 and 
8.1 xlO" 6 respectively, data shown in Figure 6). ELK4 and 
ELK1 (E26 Like Transcription Factor) are previously 
known as members of ternary complex factor (TCF) 
subfamily, which forms a ternary complex by binding to 
the serum response factor and the serum response ele- 
ment in a promoter region of the c-fos proto-oncogene 
[39] (SAP1 is a previous name for ELK4). This finding 
supports the results of the pathway analyses in the pre- 
vious section implicating that ELK4 based serum 
response mechanism might be a driving force for breast 
cancer lymph node invasion and metastasis. 

We also conducted the same analysis with N-MEG 0_>1 
and N-MEG 2 ^ 3 ((3=10). While there was no significant 
common binding sites in N2^N3 progression, we found 
STAT1, IRF2, and IRF1 can be common binding tran- 
scription factors in 50 N-MEG 0 ^ 1 (p-values of 4.4x10" 



Table 3 Enriched pathway analysis of IM-MEG+ using GSEA 



R 


Gene Set Name 


Size 


ES 


NES 


P 


FD 


Gene Set Description 


1 


WIELAND_HEPATITIS_B_INDUCED* 


96 


0.62 


3.10 


0 


0 


Up-regulated with adaptive T cell activities in viral clearance 


2 


LEE_TCELLS3_UP* 


103 


0.63 


3.05 


0 


0 


Up-regulated in immature T cell in CD4+ T cell differentiation 


3 


CANCER_UNDIFF_META_UP @ 


69 


0.65 


3.05 


0 


0 


Up-regulated in undifferentiated tumor cells 


4 


SERUM_FIBROBLAST_CELLCYCLE & 


137 


0.59 


3.02 


0 


0 


Up-regulated in serum response of fibroblasts (wound healing) 


5 


BRCA_PROGNOSIS_NEG @ 


95 


0.62 


2.98 


0 


0 


Up-regulated in breast tumor cells of negative results (prognosis) 



Top five highly enriched pathways are shown. * T cell mediated immune response pathways, & Serum response related pathway, @ Gene sets previously known 
as bad prognosis. R=rank, P=nominal p-value, FD=FDR q-value. 
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TF 
name 



Description 



P-value 
(Bonferroni corr.) 



Consensus 



N-MEG+: 
ELK4 



Ternary complex factor (TCF) 7.82^1 0" 7 

subfamily binding to serum response (7.59x1 0" 5 ) 
factors 



ELK1 TCF subfamily. 8.1 *10" 6 

Nuclear target for the ras-raf-MAPK (7.8x 1 0 4 ) 
signalling cascade 



STATl Response to cytokines and growth 4.4x10 

factors (4.3x10") 
Activated by IFNa, IFNy, EGF, 
PDGF and IL6 

IRF2 Interferon regulatory factor 2 1 . 1 x 1 0" 12 

Transcription activator of histone H4 (1.1 xlO" 10 ) 



IRF1 Interferon regulatory factor 1 7.4x10 

Inhibited by IRF2 (7.2xl0- |() ) 




m III. 



3 





Figure 6 Candidate commonly binding transcription factors. Using Pscan, 50 top-ranked N-MEG+ and N-MEG°^ 1 + were analysed in their 
upstream sequences. In N-MEG+, transcription factors involved in serum response activities have been identified. In N-MEG°^ 1 +, IRF1-IRF2 
mediated tumor suppressing pathways were identified as candidate driving pathways. 



13 ,l.lxl0~ 12 , and7.4xl0~ 12 respectively, data shown in 
Figure 6). We found that IRF1, who plays a tumor sup- 
pressing role, is negatively regulated by competitive 
transcriptional binding of IRF2, both of which were sig- 
nificantly correlated with tumor stage (p-value 0.001), 
depth of tumor infiltration (p-value 0.006) and lymph 
node metastasis (p-value 0.015) in human esophageal 
cancers [40]. Here, we also suggest that IRF2-IRF1 path- 
way is likely to be involved in lymph node invasion and 
metastasis progression in human breast cancers with 
well-known activities of STATl [41-44]. 

Discussion 

In this study, we proposed a monotonically expressed 
gene analysis (MEGA) for extracting genes that are 
related to lymph node invasion and tumor size in breast 



cancer. We analysed expression patterns over a two 
dimensional NxT space and provided results of meta- 
analysis to evaluate the gene sets. The test has been con- 
ducted on completely independent data sets. We showed 
that gene sets selected from the suggested LE' and TE' 
functions are strongly correlated with cancer prognoses 
including metastasis, relapse and survival, and showed 
significantly better results than conventional approaches. 
These functions are specially designed to capture expres- 
sional differences between two consecutive stages and 
consistency of expression patterns as well. The MEGA 
model also enabled us to analyze the impact of each clini- 
cal factor independently, and to inspect a specific stage 
transition in a cancer progression. 

Before concluding our report, it is necessary to recon- 
sider the meaning of linking clinical factors and cancer 
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prognosis in a molecular level. A general relationship 
among clinical and genetic factors is described in Figure 
7. Since a primary tumor first occurred, accumulated 
genetic and epigenetic aberrations drive the tumor's 
progression. The relationship between genetic aberra- 
tions and gene expression changes or protein function 
aberration is strongly established. So it is obvious that 
progressed tumors have different gene expressions or 
abnormal proteins. In our MEGA analysis, we focused 
on the gene expression part. If we found a candidate 
gene set which connects gene expression changes and 
lymph node invasion (A in Figure 7), it should be able 
to explain the relationship to cancer prognosis (C in 
Figure 7), because the correlation between lymph node 
invasion and cancer prognosis has been firmly proven 



(B in Figure 7). We found that previous candidate genes 
in A rarely found proper explanations of C. Here, the 
meaning of our study is summarized in two points. 
First, we tried to improve accuracy in finding candidate 
genes in A by interpreting gene expression patterns over 
lymph node progression (MEGA). Second, we provided 
a credible meta-analysis test procedure to validate the 
relationship in C. As there still unexplained important 
factors remain, we have to integrate additional data of 
other levels to finalize the lymph node invasion related 
(or causing) genes. But we suggest the future work also 
should be cross-validated in the different types of 
relations. 

Although the MEGA analysis provided a feasible link 
to clinical factors and cancer prognoses in a genetic 




— Relation is rarely proven in a genetic level 

Relation is predictable with genome-wide data 
Relation is proven with various evidences 



Sequence level data I 
Protein level data 



Transcription level data 
Clinical data 



Figure 7 Conceptual relationship of clinical and genetic factors. Colors in the box denote different levels of data. Lines are drawn in 
different shapes according to different degrees of certainty. After primary tumor occurred, accumulated (epi)genetic mutations drive cancer 
progression. With transcriptional level data, we can elucidate lymph node invasion (A) and cancer prognosis (C). But in the prediction steps, the 
previously proven relation between D and F should be properly explained (B). 
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level, some parts remain to be improved. First, the 
monotonicity can be defined in various ways. Currently 
we can rarely determine the activities of genes in an 
absolute expression level. Some genes have higher 
saturation level so that the gene expression pattern 
might show a monotonic increase through all the clini- 
cal stages. On the other hand, if an activity of a gene is 
easily saturated, the gene expression pattern above a 
certain degree would not be informative anymore. 
Handling and determining the optimized pattern on 
every gene is almost impossible, but other heuristics will 
be available using kinetics and text-mining. Second, 
integration with other omics data always count. It is still 
a big question; how to connect two different types and 
levels of information. As shown in Figure 7, protein 
level information explains another big part of clinical 
outcomes. In our opinion, those information sets should 
be integrated to an augmented 'gene' entity with other 
available information like point mutations, SNPs, and 
CNVs. In this case, the MEGA model has to be revised 
in its scoring functions. Lastly, finding for driving 
mechanisms of progression is one of the ultimate goals 
in this field. We tried to elucidate these mechanisms 
through pathway analyses and commonly binding tran- 
scription factor analyses here, but it is yet to be a strik- 
ing discovery. After we solve the prior questions, our 
approach might be more helpful in clarifying the core 
genes or genetic events that can be essential in thera- 
peutic applications. 

Methods 

Data sets 
Training data sets 

We used 278 breast tumor gene expression data from the 
expO (expression project for Oncology) database (http:// 
www.intgen.org/expo.cfm, International Genomics Con- 
sortium). The data can be also downloaded from NCBFs 
GEO database (GSE2109). From 2,158 gene expression 
profiles for all tissues and tumor types, we chose only 
breast carcinomas. Samples without pathological N and 
T stage records, or whose pathological M stage and histo- 
logical information indicate inclusion of distant metasta- 
sis were removed. Finally, the 278 non-metastasis breast 
tumor samples were categorized into 16 NxT classes 
(N0-N3, T1-T4). We also used seven normal breast 
gene expression profiles from GSE3744 [45] to infer the 
deviation of each NxT stage against the normal condi- 
tion. Normalization of data was processed with the Sim- 
pleaffy Package [46] in R by applying RMA normalization 
for every NxT stage with normal breast samples. After 
normalization, probe sets were collapsed into gene sym- 
bols using the GSEA collapse tool [28]. Each gene was 
scored by log 2 based fold change. 



ONCOMINE data sets for test 

For test sets, we used 65 breast cancer data sets from 
ONCOMINE database [47]. The 65 data sets were firstly 
classified into three major analysis types (Prognosis, 
Tumor Stage, and Tumor Grade) each of which is 
further classified into matching minor subtypes. Minor 
typing has been done by manual inspection. From the 
ONCOMINE database, we could download pre-analyzed 
tables which include sample size, statistics, and two- 
tailed p-values. Two-tailed p-values were further con- 
verted into one-tailed p-values. 

Monotonically expressed gene analysis (MEGA) 

In this work, we use two clinical variables (pathological 
T and N), but MEGA can be expanded to three or more 
variables with similar procedures. The final goal of the 
MEGA analysis is to extract N-wise monotonically 
expressed genes (N-MEG) and T-wise monotonically 
expressed genes (T-MEG) using monotonicity functions. 
We first define a two dimensional NxT space for each 
gene. The NxT space for a gene g y consists of p num- 
bers of N stages and q numbers of T stages can be 
defined as a (p+l)x(q+l) matrix X: 



f \ 

Xy 0Q ... Xy 0 q 



x(#) ^ 



typo •" 



pq 



where the first row and the first column denote gene 
expressions of normal samples. 

To represent how consistently a series of gene expres- 
sions has changed along N and T axes, we defined two 
scoring functions of the X matrix: 

p q fe-1 q 
LE(gy) = (3kj{ (Xyki - Xyji) + a S(Xyki - Xyji) } + (Xyli - XyOi) 

k=2 i=l j=l i=l 

q p fe-1 p 
TE(gy) = f3kj{ (Xyik - Xyij) + a S(Xyik - Xyij) } + (Xyil - Xyio) 

k=2 t=l j=l j=l 

where LE is a monotonicity function of gene expres- 
sion over lymph node invasion progression and TE is 
monotonicity function of gene expression over tumor 
size growth. The parameter a is a consistency factor 
which emphasizes the direction of gene expression 
changes, S is a sign function, and p is a leaping factor 
for giving weights to an specific step of stage progres- 
sion. The characteristics of parameter a were not 
explored in this study. The sign function S is defined 



[-1 •••(#< 



o) 
o) 
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And a matrix of leaping factors p is defined as a 
matrix form: 



P = 



Pio 



1 



Pp{p-i 



Because P is only meaningful in two consecutive 
stages, the P matrix is much like a diagonal matrix. The 
off-diagonal entries are always defined as 1. In this 
study, we applied the leaping factors only for N stage 
progression. 

We can interpret the value of LE and TE functions as 
a sum of moving deviations in the course of N and T 
stage progression. For each stage, we calculate the gene 
expression differences between the current stage and 
previous stages from the beginning of stage to the one 
step before the current stage. If a certain gene shows a 
monotonic increase or decrease in its gene expression 
along the N or T stages, the absolute value of the func- 
tion would be larger. 

Finally the LE and TE functions are normalized by the 
overall standard deviation of the X matrix: 



LE'(gy) = 
TE'(gy) = 



LE(gy) 
<7(X(Sy)) 

TE(gy) 
<r(X(»0) 



We selected top 40 genes for each monotonicity func- 
tion (LE' and TE') with their absolute scores. The 40 
genes are composed of 20 genes of high score and 20 
genes of low score. The genes of high LE' score means 
that the expression of those genes showed monotonic 
increase as the lymph node invasion progresses, so the 
set of the genes is named N-wise monotonically 
expressed genes with positive correlation (N-MEG+). 
Similarly, N-wise monotonically expressed genes with 
negative correlation (N-MEG-) and two other gene sets 
on a tumor size factor (T-MEG+ and T-MEG-) were 
defined. 

Meta-analysis test on ONCOMINE data set 

We performed meta-analysis tests on 65 ONCOMINE 
data sets with the selected N-MEG and T-MEG. 
Assume that we have a gene set G = {gi, g2-..g n } and an 
experiment set E = {ei, e 2 ...eij. For a gene i and experi- 
ment j, we can extract a p-value of gene i in the experi- 
ment j from the ONCOMINE data set: 

pij = p-value of gene i in experiment j (for i < n, j < k) 



Because the is basically two tailed p-value in the 
original data sets, we converted these p-values into one- 
tailed p-values: 



pij = 



• (same direction) 



pij 



1 - -y- • • • (opposite direction) 



During this procedure, some genes were missed due to 
the different naming strategies and the difference of cover- 
age among the test sets. To minimize the loss of informa- 
tion, we searched for all aliases and symbols of earlier 
versions using the recent version of HUGO Gene Nomen- 
clature Committee (HGNC) database (2009/08/23). 

Three meta-analysis methods have been applied to calcu- 
late overall p-values for a certain set of experiments - Fish- 
er's inverse chi-square [48], Stouffers overall Z [49], and 
cumulative binomial distribution. Here, the Fishers inverse 
chi-square method computes a combined statistic using, 



!ln(nW) 



which follows a % 2 clistribution with 2nk degrees of 
freedom under the joint null hypothesis [50]. 
Unweighted Stouffers Z was calculated by transforming 
every p-value into z-score upon the standard normal 
distribution, followed by summing up all z values and 
dividing by square root of the total numbers: 

\/pij' Zij in 8(0, l 2 ) 



Here, the K is a standard normal distribution, Z is 
the sum of all z-values. For the cumulative binomial 
method, we first set a threshold to determine whether a 
given p-value is significant or not. From the total nk 
numbers of p-values, we count the significant p-value 
number n s . For given a threshold p h , a probability that 
one can get a number of p-values equal to or more than 
n s incidentally is, 

1 - F(ns) nk, ph) 
= 1 - Pr(X < ns) 



i 

i=0 V 



nk 



ph\l-ph) 



nk-i 



which can be approximated using incomplete beta 
function. 



Kim et al. BMC Systems Biology 201 1, 5(Suppl 2):S4 
http://www.biomedcentral.eom/1752-0509/5/S2/S4 



Page 13 of 14 



j=a 

So, the final p-value is calculated like below. 
p = 1 - l!_p(nfe - ris - 1, ns + 1) 



Correction of Z scores from background biases 

Before finalizing p-values of meta-analyses, we noticed 
that the p-values in the ONCOMINE data sets are 
upwardly biased. In the 65 studies, 55 studies have big- 
ger number of significantly changed genes than we 
expected. And we also found that 43 studies have more 
up-regulated genes than down-regulated genes. We do 
not insist that these results mean experimental errors; 
we would rather think it is natural that many of genes 
are going to be actively expressed as the cancer pro- 
gresses and regulatory mechanisms are being broke 
down. But in the case of test procedures, we are likely 
to get more false positives unless we consider the back- 
ground biases. For example, some random gene sets 
may represent 'more than average' results and will be 
thought to be significant for cancer phenotypes. 

To correct the background biases, we generated 1,000 
random gene sets (n=20) and tested on the ONCO- 
MINE data sets. And we computed Stouffer's overall Z 
score on each random gene set. Averaged Z scores 
represent an expected Z score from a random gene set. 
From this result, we could conclude that the meta-ana- 
lysis result of gene sets look more up-regulated in the 
cancer progression than they really are. So we corrected 
all the Stouffers Z scores result from N-MEG and T- 
MEG by subtracting the mean values of Z in the up-reg- 
ulation test and adding in the down- regulation test. 

Comparing with previous studies 

We first extracted lymph node invasion related gene sets 
from previous studies (Suzuki et al [17], Abba et al [18], 
and Ellsworth et al [19]). Each gene set was tested using 
the corrected Stouffers Z test described in previous sec- 
tion. We found that the gene set from Abba et al was 
already reduced from 300 to 46 genes using eight prog- 
nosis experiments. So the Abba set was not used in 
further comparisons. We also found there are four 
experiments which used expO data set in the ONCO- 
MINE test set (Bittner et al). All the overlapping data 
was excluded in the test procedure. Additionally, we 
selected 40 genes from expO data using two-way 
ANOVA and multiple regression models. For each gene, 
both models were constructed using aov and Im' func- 
tions in R. In ANOVA models, genes were sorted by N 



stage dependent two-tailed p-values derived from their 
F-statistics because of the model's non-directionality. In 
multiple regression models, 40 genes with the highest P- 
values of N stage were selected where their T stage and 
interaction terms are not significant. Directions of regu- 
lation were determined from the estimated coefficients 
(up >0, down <0). Finally five gene sets (N-MEG, multi- 
ple regression set, two-way ANOVA set, Suzuki set, and 
Ellsworth set) were tested and their p-values were 
reported. P-values from meta-analysis were converted 
into log-odd score using -log 10 (P). For each study, the 
final score was calculated from adding the two log-odd 
scores (from up and down regulated gene sets). 

Additional material 



Additional File 1: Analysis of enriched GO Term with GOrilla (N- 

MEG 0- * 1 ). Genes related to N0->N1 stage transition showed significant 
over-representation with collagen and extracellular matrix constituent. 

Additional File 2: Analysis of enriched GO Term with GOrilla (N- 

MEG 2- * 3 ). Genes related to N2^N3 stage transition showed significant 
over-representation with immune response, wound response and 
inflammation. 
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